region proposal network
Adaptive Object Detection for Indoor Navigation Assistance: A Performance Evaluation of Real-Time Algorithms
Pratap, Abhinav, Kumar, Sushant, Chakravarty, Suchinton
-- This study addresses the critical need for accurate and efficient object detection in assistive technologies for visually impaired individuals. We systematically evaluate the performance of four prominent real-time object detection algorithms--YOLO, SSD, Faster R-CNN, and Mask R-CNN--within the context of indoor navigation assistance. Our analysis, conducted on the Indoor Objects Detection dataset, focuses on key parameters including detection accuracy, processing speed, and adaptability to the unique challenges of indoor environments. This research contributes to a deeper understanding of adaptive machine learning applications that can significantly improve indoor navigation solutions for the visually impaired, promoting inclusivity and accessibility. In today's technology-driven society, there is an increasing emphasis on enhancing accessibility for visually impaired individuals.
- North America > United States > Minnesota > Blue Earth County > Mankato (0.04)
- Asia > India > Tamil Nadu > Chennai (0.04)
- Asia > India > Maharashtra (0.04)
- Asia > China (0.04)
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection.
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate highquality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model [19], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image.
Object Detection in Aerial Images in Scarce Data Regimes
Most contributions on Few-Shot Object Detection (FSOD) evaluate their methods on natural images only, yet the transferability of the announced performance is not guaranteed for applications on other kinds of images. We demonstrate this with an in-depth analysis of existing FSOD methods on aerial images and observed a large performance gap compared to natural images. Small objects, more numerous in aerial images, are the cause for the apparent performance gap between natural and aerial images. As a consequence, we improve FSOD performance on small objects with a carefully designed attention mechanism. In addition, we also propose a scale-adaptive box similarity criterion, that improves the training and evaluation of FSOD methods, particularly for small objects. We also contribute to generic FSOD with two distinct approaches based on metric learning and fine-tuning. Impressive results are achieved with the fine-tuning method, which encourages tackling more complex scenarios such as Cross-Domain FSOD. We conduct preliminary experiments in this direction and obtain promising results. Finally, we address the deployment of the detection models inside COSE's systems. Detection must be done in real-time in extremely large images (more than 100 megapixels), with limited computation power. Leveraging existing optimization tools such as TensorRT, we successfully tackle this engineering challenge.
- Europe > France (0.45)
- North America > United States (0.45)
- Europe > Serbia (0.14)
- (5 more...)
- Summary/Review (1.00)
- Research Report > Promising Solution (1.00)
- Research Report > New Finding (1.00)
- (2 more...)
- Transportation > Ground > Road (1.00)
- Leisure & Entertainment > Sports (1.00)
- Health & Medicine (1.00)
- (5 more...)
Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA
Zhu, Yongxin, Liu, Zhen, Liang, Yukang, Li, Xin, Liu, Hao, Bao, Changcun, Xu, Linli
In this paper, we propose a novel multi-modal framework for Scene Text Visual Question Answering (STVQA), which requires models to read scene text in images for question answering. Apart from text or visual objects, which could exist independently, scene text naturally links text and visual modalities together by conveying linguistic semantics while being a visual object in an image simultaneously. Different to conventional STVQA models which take the linguistic semantics and visual semantics in scene text as two separate features, in this paper, we propose a paradigm of "Locate Then Generate" (LTG), which explicitly unifies this two semantics with the spatial bounding box as a bridge connecting them. Specifically, at first, LTG locates the region in an image that may contain the answer words with an answer location module (ALM) consisting of a region proposal network and a language refinement network, both of which can transform to each other with one-to-one mapping via the scene text bounding box. Next, given the answer words selected by ALM, LTG generates a readable answer sequence with an answer generation module (AGM) based on a pre-trained language model. As a benefit of the explicit alignment of the visual and linguistic semantics, even without any scene text based pre-training tasks, LTG can boost the absolute accuracy by +6.06% and +6.92% on the TextVQA dataset and the ST-VQA dataset respectively, compared with a non-pre-training baseline. We further demonstrate that LTG effectively unifies visual and text modalities through the spatial bounding box connection, which is underappreciated in previous methods.
- Asia > China (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (3 more...)
Dimensionality of datasets in object detection networks
Chawda, Ajay, Vierling, Axel, Berns, Karsten
In recent years, convolutional neural networks (CNNs) are used in a large number of tasks in computer vision. One of them is object detection for autonomous driving. Although CNNs are used widely in many areas, what happens inside the network is still unexplained on many levels. Our goal is to determine the effect of Intrinsic dimension (i.e. minimum number of parameters required to represent data) in different layers on the accuracy of object detection network for augmented data sets. Our investigation determines that there is difference between the representation of normal and augmented data during feature extraction.
- Transportation > Ground > Road (0.36)
- Information Technology > Robotics & Automation (0.36)
- Automobiles & Trucks (0.36)
Centerpoints Are All You Need in Overhead Imagery
Inder, James Mason, Lowell, Mark, Maltenfort, Andrew J.
Every day, observation satellites capture terabytes of imagery of the Earth's surface that feed into a wide variety of civil and military applications. This stream of data has grown so large that only automated methods can feasibly analyze it. One critical component of remote sensing analysis is object detection: locating objects of interest on the Earth's surface in overhead imagery. Automated object detection algorithms have advanced by leaps and bounds over the last decade, but they still require vast amounts of labeled data for training, which is expensive and tedious to produce. Any technique that can reduce the resources needed to label objects in overhead imagery is therefore desirable. Most existing datasets for training overhead object detectors are labeled with horizontal bounding boxes [1][2][3][4][5], object-aligned bounding boxes [6][7][8][9][10], or segmentation masks [11][12].
- Energy > Power Industry > Utilities (0.51)
- Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.36)